Partitioning strategies for distributed association rule mining
نویسندگان
چکیده
In this paper a number of alternative strategies for distributed/parallel association rule mining are investigated. The methods examined make use of a data structure, the T-tree, introduced previously by the authors as a structure for organising sets of attributes for which support is being counted. We consider six different approaches, representing different ways of parallelising the basic Apriori-T algorithm that we use. The methods focus on different mechanisms for partitioning the data between processes, and for reducing the message-passing overhead. Both ‘horizontal’ (data distribution) and ‘vertical’ (candidate distribution) partitioning strategies are considered, including a vertical partitioning algorithm (DATA-VP) which we have developed to exploit the structure of the T-tree. We present experimental results examining the performance of the methods in implementations using JavaSpaces. We conclude that in a JavaSpaces environment, candidate distribution strategies offer better performance than those that distribute the original dataset, because of the lower messaging overhead, and the DATA-VP algorithm produced results that are especially encouraging.
منابع مشابه
Optimization of Distributed Association Rule Mining Based Partial Vertical Partitioning
Association rule mining is a one of the most important technique in data mining. Data mining is the process of analyzing data from different angles & getting useful information about data. Modern organizations are geographically distributed. Using the traditional centralized association rule mining to discover useful patterns in such distributed system is not always feasible because merging dat...
متن کاملA Novel Data Partitioning Approach for Association Rule Mining on Grids
Mining association rules refers to extracting useful knowledge from large databases. Algorithms of this technique are both data and computation-intensive, which make grid platforms very attractive for them. However, to exploit these platforms, new data partitioning features are required where the specificities of both association rule mining technique and grids must be taken into consideration....
متن کاملParallel Rule Mining with Dynamic Data Distribution under Heterogeneous Cluster Environment
Big data mining methods supports knowledge discovery on high scalable, high volume and high velocity data elements. The cloud computing environment provides computational and storage resources for the big data mining process. Hadoop is a widely used parallel and distributed computing platform for big data analysis and manages the homogeneous and heterogeneous computing models. The MapReduce fra...
متن کاملDesign and Analysis of a Dynamic Load Balancing Strategy for Large-Scale Distributed Association Rule Mining
Association rule mining is one of the most important data mining techniques. Algorithms of this technique search a large space, considering numerous different alternatives and scanning the data repeatedly. Parallelism seems to be the natural solution in order to be able to work with industrial-sized databases. Large-scale computing systems, such as Grid computing environments, are recently rega...
متن کاملFuzzy Associative Classifier for Distributed Mining
Distributed data mining extracts the knowledge from distributed data sources without considering their physical location. The need for such systems arises from the fact that, in real time many data bases are distributed geographically in different locations.Often transferring data produced at local sites to centralized site for extracting knowledge results in excessive time and transmission cos...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Knowledge Eng. Review
دوره 21 شماره
صفحات -
تاریخ انتشار 2006